#rmarkdown::render(paste0("~/Documents/wgodwin28.github.io/impressions_model.Rmd"))
# clear memory
rm(list=ls())
#load libs
library(data.table)
library(ggplot2)
library(scales)
library(lubridate)
library(VGAM) #multinomial logistic
library(caret) #confusion matrix
library(randomForest) #random forest
library(C50) #boosted c5.0
library(tidyverse)
The goal is to build a model that predicts “impressions”, or the number of people the ad reached, using relevant advertisement covariates. I’ll use political advertisement spending data released by google. The dataset contains over 200,000 unique ads since June 2018 with relevant metadata including number of impressions as a ordinal categorical variable. Intially, I need to figure out useful predictors in the model by investigating the univariate distributions and basic bivariate relationships.
#Downloaded from here: https://transparencyreport.google.com/political-ads/region/US?hl=en
#read in data
dt <- fread("~/Desktop/google_ads/data/google-political-ads-creative-stats.csv")
#glimpse(dt)
#relevant variables
vars_keep <- c("Ad_Type", "Regions", "Advertiser_ID", "Date_Range_Start", "Date_Range_End",
"Num_of_Days", "Impressions", "Spend_Range_Min_USD", "Spend_Range_Max_USD")
#subset to variables of interest and create useful variables
dt <- dt %>%
select(vars_keep) %>% #keep relevant variables
mutate(startYear = substr(Date_Range_Start,1,4), #variable indicating year ad started
Region = ifelse(grepl("EU", Regions), "EU", "US") %>% as.factor(), #create cleaner regions variable
month_year = format(as.Date(Date_Range_Start), "%m-%Y"),
week_start = floor_date(as.Date(Date_Range_Start), unit = "week"),
week_end = floor_date(as.Date(Date_Range_End), unit = "week"),
cost_cat = paste0(Spend_Range_Min_USD, "-", Spend_Range_Max_USD) %>%
factor(levels=c("0-100", "100-1000", "1000-50000", "50000-100000", "100000-NA"),
labels = c("0-100", "100-1k", "1k-50k", "50k-100k", "100k+")),
Spend_Range_Min_USD = as.factor(Spend_Range_Min_USD),
Ad_Type = as.factor(Ad_Type),
Impressions = factor(Impressions, levels = c("≤ 10k", "10k-100k", "100k-1M", "1M-10M", "> 10M"),
labels = c("Under 10k", "10k-100k", "100k-1M", "1M-10M", "10M+")))
#filter(regions=="US") #only include US data
dt <- dt %>%
mutate(
case_when(
cost_cat=="100000-NA" ~ "100k+"
)
)
Load in the dataset and prep it for visualizing and modeling.
#impressions
ggplot(dt, aes(Impressions)) +
geom_bar() +
geom_text(stat = 'count', aes(label = percent(..count../nrow(dt)), vjust = -0.2)) +
theme_bw()
Most ads are classified under the “Under 10k” category, indicating an imbalanced classification problem. Due to this imbalance, I’ll plot impression counts on the log scale to observe differences at smaller frequecies.
One might presume that money spent for an ad would be an excellent predictor of impressions, given that that most tech platforms follow an advertising model of more money spent = more exposure. Google provides dollars spent as a categorical variable of 5 bins. Let’s see how well ad dollars spent and number of impressions correlate.
table(dt$Impressions, dt$cost_cat)
##
## 0-100 100-1k 1k-50k 50k-100k 100k+
## Under 10k 151413 9007 533 0 0
## 10k-100k 15999 14049 4147 1 0
## 100k-1M 2193 5591 7312 33 8
## 1M-10M 14 549 1857 58 23
## 10M+ 0 0 239 27 27
Since “ad cost” and “impressions” variables are both ordinal, 5-bin variables, if one was to map the categories to each other on 1 to 1 basis, we could calculate accuracy of a model that solely used “ad cost” to predict “impressions”. And turns out, it would be correct 81.1% of the time! This tells us two things: money spent on ads is a primary driver of total impressions but it’s not the only driver. Maybe we can find other predictors in the data set that could improve our future model’s accuracy.
#time series plot of ad counts across impression category
dt %>%
group_by(date=as.Date(week_start), Impressions) %>%
summarize(weekly_ad_count=n()) %>%
ggplot(aes(date, weekly_ad_count, color=Impressions)) +
geom_point() +
geom_smooth(method = "loess") +
xlab("Date") +
ylab("Number of Ads (on log scale)") +
geom_vline(xintercept = as.Date("2018/11/06"), linetype=4) +
geom_text(aes(x=as.Date("2018/10/28"), y=9400, label=" Midterm"),
colour="blue", angle=90, text=element_text(size=11)) +
scale_y_log10() +
scale_x_date(labels = date_format("%m/%Y"), breaks = date_breaks("2 month")) +
scale_color_discrete(name = "Impressions") +
theme_bw()
In order to assess whether date of ad may be related to number of impressions, I aggregated the number of ads by week of ad start date and impression category. Then I plotted the number of ads across the full time series, colored by impression category, and fit a loess smoother through each. The loess, in this case, is a helpful way to explore time trends in the data.
There does seem to be some relationship, albeit non-linear, between number of ads and ad start date. Ad counts, regardless of impression category, appear to increase up the the 2018 midterm election. However, the time trend is similar across the different impression categories-note the similar shapes for each loess curve. In other words, the relative fraction each impression category comprises, with reference to the impression envelope, does not change much over time. So I’ll hold off on including a time variable in my model for now.
#plot impression counts by region
ggplot(dt, aes(Impressions, fill=Region)) +
geom_bar(position = "dodge") +
scale_y_log10() +
ylab("Count (log 10 scale)") +
theme_bw()
Above is a bar chart showing counts of each impression category across the two regions: EU and US. Note that these frequencies were plotted on a log 10 scale in order to illustrate relative frequencies within the lower frequency categories (100k-1M impressions, etc). The EU has fewer ads within each category and appears to have a greater proportion of its ads with >10k impressions compared to the U.S., which indicates that this variable may be a useful covariate in the model.
#plot impression counts by type of ad
ggplot(dt, aes(Impressions, fill=Ad_Type)) +
geom_bar(position = "dodge") +
scale_y_log10() +
ylab("Count (log 10 scale)") +
theme_bw()
This figure shows counts (on log 10 scale) of ads across impression category for each type of ad. “Image” ads remain the most popular type across impression category, while “Text” ads significantly decrease as number of impressions increase, relatively to other ad types.
dt %>%
mutate(days_running=difftime(Date_Range_End, Date_Range_Start, units = "days")) %>%
ggplot(aes(Impressions, days_running)) +
geom_boxplot() +
ylab("Number of Days Aired") +
theme_bw()
This figure shows box and whiskers of number of days ad was aired across impression category. While these data do look noisy, we see indication of a potential trend: as ad air time increases so does the number of impressions.
Based on the exploratory plots and tables, the covariates we’ll use to predict impressions are: cost of the ad, ad type (text, video, or image), region the ad aired (U.S. or E.U.), and number of days the ad aired.
#model reponse and covariates
response <- "Impressions"
covariates <- c("Num_of_Days", "Region", "Ad_Type", "cost_cat")
#create training data frame
train <- sample(c(TRUE, FALSE), nrow(dt), rep=TRUE)
dt.train <- dt %>%
select(response, covariates) %>%
filter(train)
dt.test <- dt %>%
select(response, covariates) %>%
filter(!train)
#extract the testing response and predictors for prediction
x.test <- dt.test %>% select(covariates)
y <- dt.test %>% select(response)
#create standard formula object
mod.formula <- as.formula(paste(response,
paste(covariates, collapse = "+"), sep = " ~ "))
We’ll model impressions using 3 different methods: logistic regression and two tree-based methods. Logistic is helpful as a first pass since it generally performs well and produces interpretable coefficients. The data will be ramdomly split into a training and testing set for evaluation of model performance.
#Build the model
model1 <- vglm(formula = mod.formula,
data = dt.train, family = "multinomial")
#extract model summary
summary(model1)
##
## Call:
## vglm(formula = mod.formula, family = "multinomial", data = dt.train)
##
## Pearson residuals:
## Min 1Q Median 3Q Max
## log(mu[,1]/mu[,5]) -321.35 0.017986 0.149250 0.3068891 11.66
## log(mu[,2]/mu[,5]) -294.34 -0.268472 -0.209857 -0.0234441 28.93
## log(mu[,3]/mu[,5]) -61.13 -0.088700 -0.065751 -0.0088781 621.84
## log(mu[,4]/mu[,5]) -53.78 -0.006658 -0.003166 -0.0001966 203.97
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept):1 2.001e+01 5.678e+01 NA NA
## (Intercept):2 1.783e+01 5.678e+01 0.314 0.753486
## (Intercept):3 1.639e+01 5.678e+01 0.289 0.772893
## (Intercept):4 1.183e+01 5.678e+01 0.208 0.834978
## Num_of_Days:1 -2.870e-02 1.395e-03 -20.570 < 2e-16 ***
## Num_of_Days:2 -1.119e-02 1.312e-03 -8.526 < 2e-16 ***
## Num_of_Days:3 -7.109e-03 1.251e-03 -5.684 1.32e-08 ***
## Num_of_Days:4 -3.746e-03 1.243e-03 -3.012 0.002591 **
## RegionUS:1 2.468e+00 2.291e-01 10.774 < 2e-16 ***
## RegionUS:2 3.014e+00 2.288e-01 13.172 < 2e-16 ***
## RegionUS:3 1.893e+00 2.254e-01 8.400 < 2e-16 ***
## RegionUS:4 6.959e-01 2.264e-01 3.074 0.002115 **
## Ad_TypeText:1 1.559e+01 8.886e-01 17.542 < 2e-16 ***
## Ad_TypeText:2 1.036e+01 8.863e-01 11.688 < 2e-16 ***
## Ad_TypeText:3 5.909e+00 8.842e-01 6.683 2.35e-11 ***
## Ad_TypeText:4 3.017e+00 8.616e-01 3.502 0.000462 ***
## Ad_TypeVideo:1 5.582e+00 2.068e-01 26.984 < 2e-16 ***
## Ad_TypeVideo:2 3.985e+00 2.038e-01 19.549 < 2e-16 ***
## Ad_TypeVideo:3 1.935e+00 1.978e-01 9.787 < 2e-16 ***
## Ad_TypeVideo:4 6.630e-01 2.023e-01 3.278 0.001046 **
## cost_cat100-1k:1 -5.828e+00 1.065e+02 -0.055 0.956346
## cost_cat100-1k:2 -1.297e+00 1.065e+02 -0.012 0.990283
## cost_cat100-1k:3 8.575e-01 1.065e+02 0.008 0.993573
## cost_cat100-1k:4 3.943e+00 1.065e+02 0.037 0.970458
## cost_cat1k-50k:1 -2.854e+01 5.678e+01 -0.503 0.615181
## cost_cat1k-50k:2 -2.082e+01 5.678e+01 -0.367 0.713857
## cost_cat1k-50k:3 -1.543e+01 5.678e+01 -0.272 0.785841
## cost_cat1k-50k:4 -1.047e+01 5.678e+01 -0.184 0.853698
## cost_cat50k-100k:1 -5.398e+01 5.773e+03 NA NA
## cost_cat50k-100k:2 -4.778e+01 5.313e+03 NA NA
## cost_cat50k-100k:3 -2.050e+01 5.678e+01 -0.361 0.718112
## cost_cat50k-100k:4 -1.287e+01 5.678e+01 -0.227 0.820680
## cost_cat100k+:1 -5.538e+01 7.978e+03 NA NA
## cost_cat100k+:2 -4.949e+01 7.469e+03 NA NA
## cost_cat100k+:3 -2.299e+01 5.679e+01 -0.405 0.685567
## cost_cat100k+:4 -1.366e+01 5.678e+01 -0.241 0.809938
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Names of linear predictors: log(mu[,1]/mu[,5]), log(mu[,2]/mu[,5]),
## log(mu[,3]/mu[,5]), log(mu[,4]/mu[,5])
##
## Residual deviance: 90046.23 on 425752 degrees of freedom
##
## Log-likelihood: -45023.11 on 425752 degrees of freedom
##
## Number of Fisher scoring iterations: 21
##
## Warning: Hauck-Donner effect detected in the following estimate(s):
## '(Intercept):1', 'cost_cat50k-100k:1', 'cost_cat50k-100k:2', 'cost_cat100k+:1', 'cost_cat100k+:2'
##
##
## Reference group is level 5 of the response
#Predict using the model
probability <- predict(model1, x.test, type="response")
dt.test <- dt.test %>%
mutate(predicted_cat = apply(probability, 1, which.max),
predicted_name = case_when(predicted_cat==1 ~ "Under 10k",
predicted_cat==2 ~ "10k-100k",
predicted_cat==3 ~ "100k-1M",
predicted_cat==4 ~ "1M-10M",
predicted_cat==5 ~ "10M+"),
predicted_name = factor(predicted_name,
levels = c("Under 10k", "10k-100k", "100k-1M", "1M-10M", "10M+")))
#Accuracy of the model
mtab <- table(dt.test$predicted_name, dt.test$Impressions)
confusionMatrix(mtab)
## Confusion Matrix and Statistics
##
##
## Under 10k 10k-100k 100k-1M 1M-10M 10M+
## Under 10k 79387 8358 1110 12 0
## 10k-100k 1257 7403 2334 161 1
## 100k-1M 8 1252 4048 956 105
## 1M-10M 0 0 69 129 14
## 10M+ 0 0 0 8 21
##
## Overall Statistics
##
## Accuracy : 0.8533
## 95% CI : (0.8511, 0.8554)
## No Information Rate : 0.7564
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.5793
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Under 10k Class: 10k-100k Class: 100k-1M
## Sensitivity 0.9843 0.43514 0.53538
## Specificity 0.6351 0.95812 0.97657
## Pos Pred Value 0.8933 0.66359 0.63558
## Neg Pred Value 0.9288 0.89935 0.96496
## Prevalence 0.7564 0.15955 0.07091
## Detection Rate 0.7445 0.06943 0.03796
## Detection Prevalence 0.8334 0.10462 0.05973
## Balanced Accuracy 0.8097 0.69663 0.75598
## Class: 1M-10M Class: 10M+
## Sensitivity 0.101896 0.1489362
## Specificity 0.999212 0.9999249
## Pos Pred Value 0.608491 0.7241379
## Neg Pred Value 0.989316 0.9988743
## Prevalence 0.011872 0.0013223
## Detection Rate 0.001210 0.0001969
## Detection Prevalence 0.001988 0.0002720
## Balanced Accuracy 0.550554 0.5744305
The confusion matrix shows the model predictions (row-wise) stacked against the actual data (column-wise). If the model fit the data perfectly, we’d only see values along the diagonal and would see zeros everywhere else. The overall accuracy is 85%, indicating that the model correctly labels the testing data 85% of the time. Sensitivity (we’ll use “recall”) and specificity varies across the impression categories, as we’d expect. When predicting “Under 10k” impressions, a recall of 0.98 indicates that the model get 98% of the actual “Under 10k” impressions correct. The relatively poor specificity indicates that the model correctly predicts that an ad will NOT get “Under 10k” impressions 63% of the time. We see the opposite result from the “10M+” impressions category. Because we have an imbalanced classification problem with so few “10M+” impression ads, the model can predict that an ad will not get “10M+” impressions with 99.99% confidence. However, recall of 0.13 reveals that 87% of actual “10M+” impression ads are incorrectly labeled by the model.
Model evaluation can be based on overall accuracy of the model or on more specific metrics like precision or recall, depending on the research question. For instance, consider that the goal of many political action committees (PAC) is to reach as many people as possible using as little money as possible. In order to figure out how to do this, they could start with looking at what covariates are conditionally associated with number of impressions from the logistic model coefficients. We can see that even controlling for ad cost, an ad with longer air time tends to achieve more impression. Going even further, they could build a model that optimizes for their main goal-accurately predicting ads with millions of impressions or recall. As we noted, the logistic model above has poor recall for impression categories of particular interest as it correctly labels a “1-10M” impression ad only 10% of the time and a “10M+” impression ad only 13% of the time. Compare this to our original “model, that simply used the cost of ad to predict impressions, which had a recall for”10M+" of 9%. Both models leave room for improvement. Let’s see if we can improve the recall at the high impression end using a tree-based models.
#Build the model
model2 <- randomForest(mod.formula, data = dt.train)
#Summarize the model
summary(model2)
## Length Class Mode
## call 3 -none- call
## type 1 -none- character
## predicted 106447 factor numeric
## err.rate 3000 -none- numeric
## confusion 30 -none- numeric
## votes 532235 matrix numeric
## oob.times 106447 -none- numeric
## classes 5 -none- character
## importance 4 -none- numeric
## importanceSD 0 -none- NULL
## localImportance 0 -none- NULL
## proximity 0 -none- NULL
## ntree 1 -none- numeric
## mtry 1 -none- numeric
## forest 14 -none- list
## y 106447 factor numeric
## test 0 -none- NULL
## inbag 0 -none- NULL
## terms 3 terms call
#Predict using the model
dt.test$pred_randomforest <- predict(model2, x.test)
#Accuracy of the model
mtab2 <- table(dt.test$pred_randomforest, dt.test$Impressions)
confusionMatrix(mtab2)
## Confusion Matrix and Statistics
##
##
## Under 10k 10k-100k 100k-1M 1M-10M 10M+
## Under 10k 79596 8513 1136 12 0
## 10k-100k 1043 7006 1932 154 1
## 100k-1M 13 1492 4376 884 56
## 1M-10M 0 2 113 209 42
## 10M+ 0 0 4 7 42
##
## Overall Statistics
##
## Accuracy : 0.8555
## 95% CI : (0.8534, 0.8576)
## No Information Rate : 0.7564
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.5839
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Under 10k Class: 10k-100k Class: 100k-1M
## Sensitivity 0.9869 0.41180 0.57876
## Specificity 0.6282 0.96507 0.97532
## Pos Pred Value 0.8918 0.69120 0.64155
## Neg Pred Value 0.9392 0.89630 0.96809
## Prevalence 0.7564 0.15955 0.07091
## Detection Rate 0.7464 0.06570 0.04104
## Detection Prevalence 0.8370 0.09506 0.06397
## Balanced Accuracy 0.8075 0.68844 0.77704
## Class: 1M-10M Class: 10M+
## Sensitivity 0.165087 0.2978723
## Specificity 0.998510 0.9998967
## Pos Pred Value 0.571038 0.7924528
## Neg Pred Value 0.990053 0.9990711
## Prevalence 0.011872 0.0013223
## Detection Rate 0.001960 0.0003939
## Detection Prevalence 0.003432 0.0004970
## Balanced Accuracy 0.581798 0.6488845
Compared to multinomial logistic regression, random forest does perform better for lower frequency classes like “1-10M” and “10M+”. Despite doubling the recall observed in logistic regression, a model that correctly labels an ad to get “10M+” impressions only 30% of the time leaves a lot to be desired. Outside of the scope of this analysis but in an effort to improve recall, we could oversample the minority impression categories. This has the effect of balancing out the distribution of impressions and would likely improve prediction in minority categories. The overall accuracy of the random forest model is similar to the logisitic regression at 85%, indicating that the gains made in recall for high impression classes may have come at cost for other components of model performance.
#Build the model
model3 <- C5.0(mod.formula, data = dt.train, trials = 8)
#Predict using the model
dt.test$pred_c50 <- predict(model3, x.test)
#Accuracy of the model
mtab3 <- table(dt.test$pred_c50, dt.test$Impression)
confusionMatrix(mtab3)
## Confusion Matrix and Statistics
##
##
## Under 10k 10k-100k 100k-1M 1M-10M 10M+
## Under 10k 79563 8474 1129 12 0
## 10k-100k 1023 6051 1476 88 1
## 100k-1M 66 2488 4951 1116 116
## 1M-10M 0 0 5 50 24
## 10M+ 0 0 0 0 0
##
## Overall Statistics
##
## Accuracy : 0.8498
## 95% CI : (0.8476, 0.8519)
## No Information Rate : 0.7564
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.5692
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Under 10k Class: 10k-100k Class: 100k-1M
## Sensitivity 0.9865 0.35567 0.65481
## Specificity 0.6299 0.97112 0.96179
## Pos Pred Value 0.8922 0.70043 0.56667
## Neg Pred Value 0.9376 0.88814 0.97334
## Prevalence 0.7564 0.15955 0.07091
## Detection Rate 0.7461 0.05675 0.04643
## Detection Prevalence 0.8363 0.08102 0.08194
## Balanced Accuracy 0.8082 0.66340 0.80830
## Class: 1M-10M Class: 10M+
## Sensitivity 0.0394945 0.000000
## Specificity 0.9997248 1.000000
## Pos Pred Value 0.6329114 NaN
## Neg Pred Value 0.9885879 0.998678
## Prevalence 0.0118725 0.001322
## Detection Rate 0.0004689 0.000000
## Detection Prevalence 0.0007409 0.000000
## Balanced Accuracy 0.5196096 0.500000
A Boosted C5.0 model is based on simple tree-based framework that uses “boosting” methods. While a random forest splits the predictor space on into partitions that minimize impurity/maximize information criterion for each independent tree, boosting models grow trees sequentially with the residuals of the previous tree becoming the response variable of the subsequent tree. While this smoothing over residuals may sometimes improve model performance, in this context, the random forest performed slightly better overall.
There are a variety of other models one could use to classify impressions from naive Bayes to support vector machines, which could lead to improved overall accuracy and improved recall. There’s also feature engineering that we didn’t investigate at length (like lagged variables or midterm-related associations). Those pursuits are fodder for future projects. The takeaway from this analysis is that logistic regression, while some times not as accurate, still can construct a useful springboard for further analyses due to its interpretability. And regression trees-random forests and boosting methods-can be fast, flexible frameworks for optimizing toward a specific performance metric.